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Abstract 

We present and compare various approaches to a classical selection problem on Graphics Processing 
Units (GPUs). The selection problem consists in selecting the fc-th smallest element from an array of 
, size n, called k-th order statistic. We focus on calculating the median of a sample, the n/2-th order 

£^ ' statistic. We introduce a new method based on minimization of a convex function, and show its numerical 

^ , superiority when calculating the order statistics of very large arrays on GPUs. We outline an application 

' of this approach to efficient estimation of model parameters in high breakdown robust regression. 
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m . 

We consider a classical selection problem: calculate the A;-order statistic of a sample x6 3?' 
Calculation of the median of a sample is a special instance of the selection problem, 
with Med(x) = £([(„+i)/2]), where [t] denotes the integer part of t. Selection problem has 
various applications in data analysis, in particular in high-breakdown regression [[191 , [1271 , which 
is a motivating application discussed later in this paper. Selection problem is also valuable 
for other high-performance computing applications, such as image processing, computational 
aerodynamics and simulations, and it has received attention in parallel computing literature [Q], 

0, EDI, EH- 

Our concern is efficient calculation of the order statistics of very large vectors by offloading 
computations to GPUs. General purpose GPUs have recently become a powerful alternative to 
traditional CPUs, which allow one to offload various repetitive calculations to the GPU device. 
GPUs have many processor cores (p = 240 in NVIDIA' s Tesla CI 060 and p = 448 in C2050), and 
can execute thousands of threads concurrently |[23l. The computational paradigm here is SIMT 
(single-instruction multiple-thread). Different streaming multiprocessors can execute different 
instructions, although threads within groups (so called warps) are executed in SIMD (single 
instruction multiple data) fashion. The peak performance is achieved when there are no, or 
almost no branching due to conditional statements. 

The selection problem is related to sorting. A naive approach to the selection problem is to 
sort the elements of x and select the desired order statistic. On CPU, the complexity of the sort 
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operation is 0{n log n) by using one of the standard sorting algorithms. Radix sort has 0(nb) 
complexity where b is the number of bits in the key iBTfl . 

An improvement over sorting is to use a selection algorithm, such as quickselect ll24ll . OTTl . 
which is based on quicksort, and has an expected complexity of 0(n). The median of medians 
algorithm developed in 01, also known as BFPRT after the last names of the authors, has 0(n) 
worst case complexity, although empirical evidence suggests that on average quickselect is faster. 

The sort operation was parallelized for GPUs. Parallel sorting algorithms include GPU quick- 
sort ®, GPU sample sort ED, bucket-merge hybrid sort [Ml, GPU radix sort 11201. Il29l and 
others [fi2l. 03], ||32~1 . At the time of writing, GPU radix sort, which was implemented in Trust 
software IfTSl , and in the CUDA 4.0 library [1221 released in March 2011, is the most efficient 
GPU sort algorithm. In empirical tests it delivered performance of up to 580 million elements/sec 
(for 32 bit keys on Tesla C2050) f|29l , which was consistent with our own evaluation. There 
have been attempts to parallelize selection algorithms [Q]> ED, fl30ll , 11341 for coarse-grained 
multicomputers, and in the the framework of bulk- synchronous parallelism paradigm, but these 
methods use divergent threads of computation. To our knowledge there is no parallel version of 
a selection algorithm suitable for GPUs at the moment. 

In this contribution we analyze various alternatives for parallel calculation of the median of a 
large real vector (n > 10 5 ) on a GPU device, and propose a new parallel selection method. Rather 
than attempting to parallelize an existing serial selection algorithm, we propose an alternative 
method, based on minimization of a convex cost function. Calculation of the cost function is 
performed by reduction in parallel. By using an effective minimization algorithm, which requires 
only a few iterations, we design a cost efficient approach to parallel selection. 

The paper is structured as follows. In Section [I] we outline the approaches to calculating the 
median and order statistics in parallel and list our alternatives suitable for GPUs. In Section HIT] 
we present a new approach based on minimizing a convex function, or alternatively, on solving a 
nonlinear equation. Section JV] presents Kelley's cutting plane algorithm, which we found to be 
the most efficient and stable among the optimization methods we evaluated. Section |V] is devoted 
to comparison and benchmarking of different alternatives for selection on GPUs. In Section IVTl 
we outline two applications that require efficient selection algorithms, taken from the area of 
data analysis. Section IVTll concludes. 

II. Approaches to selection on GPU 

Consider a vector x G 9ft n , n of order of 10 5 — 10 9 . We are interested in efficient calculation 
of its median, or an order statistic. As we mentioned in the Introduction, this can be achieved 
by using either sorting or selection algorithms. 

Our problem has a specific feature. The vector x is stored as an array of floats (single or double 
precision) on the GPU device. The reason is that often the components of x are themselves 
calculated in parallel on GPU. This is the case in our motivating application. 

In addition, we consider a situation where x is so large that it does not fit the memory of a 
single GPU device nor CPU memory, and is distributed over several GPU devices. 

Let us list our alternatives. 

1) Perform parallel sorting on GPU by any available GPU sorting algorithm. In our study we 
used the fastest GPU radix sort (we benchmarked several GPU sorting algorithms, and the 
results are consistent with the conclusions by Merrill and Grimshaw [1201 '). 



3 



2) Transfer x to CPU and use quickselect algorithm. 

3) Use quickselect on GPU running as a single thread. 

4) Develop an alternative GPU selection algorithm. 

We note that when a large number of calculations of order statistics (in particular, the medians) 
of different vectors are required, even small gains in computational efficiency become important. 

All the listed alternatives seem to be feasible and competitive. The cost of alternative 2 includes 
the cost of copying data between GPU and CPU. The quickselect algorithm is very efficient on 
modem CPUs, as compared to other sorting and selection methods. Alternative 3 eliminates the 
need to transfer the data, but at the expense of much slower execution of quickselect on one of 
the GPU cores. 

When the data are distributed over several GPUs, alternatives 2 and 3 are no longer feasible, 
and GPU sorting algorithms may require substantial modifications and will degrade in perfor- 
mance, since the data needs to be moved between different GPU devices. However, the nature 
of our alternative selection algorithm makes it applicable to multiple GPU situation. 

III. Approach based on search and optimization 

It was known to Laplace and Gauss that the arithmetic mean, the median, the mode, and some 
other functions are solutions to simple optimization problems 0, ifTTI . lfT6ll . A recent account 
and generalizations can be found in fl6l, [|7], ll35l . In particular, the median is a minimizer of 
the following expression: 



n 




The objective is clearly non-smooth (it is piecewise linear), but it is convex, because the sum 
of convex functions is always convex. Convexity of the objective implies that there exists a 
unique minimum of the above expression, which can be easily found by a number of classical 
univariate optimization methods. Of course, numerical solution to problem (Q3 is more expensive 
than quickselect on CPUs, but taking into account that the data reside on the GPU, and that each 
evaluation of all the terms in the objective can be done in parallel in 0(j) operations (p is the 
number of processor cores), this method becomes a promising alternative. 

For calculation of a k— th order statistic, the following objective is used [|6), [|7) 

n 

0Sk(x) = argmin ) u(xi - y), (2) 
ii < ' 

where 

u(t) = 

The objective (0 is also piecewise linear convex. 

For the purposes of optimization, we considered the following alternatives: 1) Brent's method 
[|24l . 2) the golden section method, 3) an analog of the Newton's method for non-smooth 
functions (3), and 4) Kelley's cutting plane method [17J. 



j (n - k + \)t if t > 0, 
1 -{k-\)t if * < 0. 
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When the objective / is differentiable, a minimization problem can be converted into a root- 
finding problem: solve f'(y) = 0. For non-smooth functions we need to solve the following 
nonlinear equation 

o e df(y), 

where df is the Clarke's subdifferential [0, ifTOl . For a univariate function, Clarke's subdiffer- 
ential at a point t is simply a set of values df(t) = {zG 3?|V?/ G 3?, /(?/) > z(y — t) + /(£)}, 
such that any line with the slope z G which passes through (t,f(t)), is a tangent line to 

the graph of /. An element of the subdifferential is called a subgradient. 
We know that for the modulus function 

r {-i} if * <o, 

= < [-1,1] if t = 0, 
{ {1} if * > 0. 

We can express df using g : K -> / C S, where J is a set of intervals on a real line, 

g(y) = count(xi > y) + count(xi < y)(— 1) + count(xi = y)[— 1, 1], 

where count is the number of elements of x satisfying a given condition, and where the usual 
Minkowski sum for sets is used. 

There are many root-finding algorithms, which can all be adapted to equation G g(y). In 
particular, we adapted the classical bisection method, similar to how it was done in [fl3l but in 
a different context, as well as the method of parabolas combined with golden section, which is 
also known as Brent algorithm [24] . In fact this root finding algorithm is equivalent to Brent's 
optimization method, however we looked at both alternatives, as implementation details and 
stopping criteria may play a role in their efficiency. 

Evaluation of the objective in (Q~|) requires summation. On GPUs it can be performed in parallel 
by reduction, using binary trees in 0(- + logp) operations. Reduction on GPU is efficiently 
implemented and relies on various hardware related aspects and strategies, like loop unrolling 
ETI . Furthermore, calculation of (fl3 on GPU involves only reading from (slow) global memory 
in coalescent blocks, but no writing when swapping the elements of x, as in sorting algorithms 
(all temporary results are written into a much faster shared memory). As for Eq. ©, used for 
order statistics, it introduces only minimal branching in calculating each term of the sum. In 
the experiments we conducted, this did not produce any noticeable effect on the run time of the 
algorithm. 

IV. Kelley's cutting plane method 

The methods of function minimization and root finding we have used are classical, and are 
discussed in detail in several textbooks, e.g. [|24l . Kelley's cutting plane method IfTTl is less 
known. In addition, as we show in the subsequent sections, it was the most stable and efficient 
algorithm in our study. Therefore it is worth to outline Kelley's algorithm in this section. 

The cutting plane method is relying heavily on the convexity of the objective, and requires 
calculation of both the objective / and its subgradient g. This method iteratively builds a lower 
piecewise linear approximation to the objective. 
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At each iteration, the algorithm evaluates the value of the objective / and its subgradient g. 
A piecewise linear lower estimate of / is built using the formula 

h(y)= max {g(vi)(y - yt) + /(y<)}, 
i=i, ...,k 

where y^ are the points the objective was evaluated during the previous k iterations. The mini- 
mizer of h is chosen as the yk+i- The algorithm starts with y\ = xq) and y 2 = X( n ), which are 
evaluated on GPU in parallel using reduction. Figure |4] illustrates these steps. 

It is worth noting that the minimizer of h is always bracketed by two values y L and y R , 
initially y\ and y 2 , and is found explicitly by 

= f(yn) - Kvl) + VLg^VL) - Vr9{vr) 

k+1 g{vL)-g{yR) 

y L and y R are updated in the following way: if g(y k+1 ) < then y L «- y k+1 and if g(y k+1 ) > 
then y R <— y k +i- If g(y k+ \) = 0, the solution is found. 

Let us present Kelley's cutting plane algorithm for completeness. 

Algorithm 1 (Cutting plane algorithm): Input: / - the objective function, g - a subgradient of 
/, maxit - the upper bound on the number of iterations. 

Input/output: y L , y R - the ends of the interval containing the minimizer. 
Output: y - minimizer found. 

0. fL <- f(y L ),gL <- g{y L ), fR <- f(y R ),gR <- g{y R ). 

1. for i = 1; i < maxit; i — i + 1 do: 

1.1 t^(fR-fL + y L *gL-y R * gR)/(gL - gR). 

1.2 ft<-f{t),gt<-g(t). 

1.3 If Stopping criteria then y <— t and exit. 

1.4 If gt < 

then y L <-t,fL<- ft, gL <- gt. 
else y R 4-t, fR<- ft, gR <- gt. 

2. y <— t and exit. 

The stopping criteria at step 1.3 comprise the following: gt = (the point with G df(t) 
was found), y R — yi < tolerance/, \gt\ < tolerance g . 

The cutting plane algorithm in one dimension has little overhead, and both / and g are easily 
computed in parallel and simultaneously on a GPU device. Once an approximation y to the 
median is computed with the desired accuracy, a simple loop finds the exact value of the median 
in 0(~ + logp) operations on a GPU device by reduction [J. 

The cutting plane algorithm starts with calculation of / and g at the extremes of the range of 
the data (at step 0). The values yi = Vi = X(i) and y R = y 2 = £( n ) can be calculated in parallel 
by reduction. Let us also notice that the values of / and g at these points are computed by simple 
formulas: g(y L ) = -n + 2, g(y R ) = n - 2, f(y L ) = Y. x i~ n VL and f(y R ) = ny R - ^x t . 
The values yL, y R and YI x % can be computed in a single parallel reduction operation, which is 
advantageous compared to computations of yL,Vn, f{yi) and f{y R ) independently using four 
reductions. Therefore the complexity of the Algorithm 1 is at most maxit + 1 parallel reductions. 

'This is a simple loop which selects the largest element Xi < y. 
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To end this section, we notice that the cutting plane method takes only a few iterations to 
converge to an approximate solution y, under 30 iterations in our experiments with n up to 32 
million and tolerance f = 10~ 12 . Furthermore, we improved the runtime of the algorithm by 
combining it with parallel sorting as follows. First, we run the cutting plane method for very 
few iterations, say 5-7 iterations. After completing these iterations, we have an interval [yL,yp\ 
containing the mimimizer, which is the median of x. We think of this interval as a pivot in 
selection algorithms. At the second stage, we select the values Xj which fall within the (open) 
pivot interval ]y^, yn[ and copy them to a smaller array z. This can be done in parallel on GPU 
using copy_if operation. Then the elements of z are sorted in parallel using GPU radix sort. 
The median is then the k— th order statistic z^) with k — [~] — m, and m being the number 
of elements of x smaller than or equal to yL, recorded during execution of the cutting plane 
method. 

This way we obtain a hybrid selection algorithm, which benefits from the fact that sorting is 
performed on a much smaller array z. The number of iterations of the cutting plane method is 
selected to maximize efficiency: the algorithm stops when the cost of its iteration outweighs the 
benefit of further reducing the pivot interval for faster conditional copy and sorting. This number 
is empirically selected for a given architecture. In our experiments we stopped the cutting plane 
algorithm after 7 iterations (for n = 2 25 ), as at that time the pivot interval contained under 2 19 
elements, and its sorting was already very fast. 

V. Empirical study of computational efficiency 

A. Data sets 

We selected the following methods for numerical comparison: quickselect on CPU, quicks- 
elect on GPU as a single thread, GPU version of the radix sort ll29l , four methods based on 
minimization of (OQ) (Brent's, golden section, nonsmooth quasi-Newton and cutting plane) and 
two methods based on solving E g(y) (bisection and Brent's root finding algorithm). 

We randomly generated the following data sets of varying length n E {8192 = 2 13 , 32768 = 
2 15 , 131072 = 2 17 , 524288 = 2 19 , 2097152 = 2 21 , 8388608 = 2 23 , 33554432 = 2 25 , 134 x 10 6 « 
2 27 }: 

1) Uniform x { ~ U(0, 1) 

2) Normal x { ~ N(0, 1) 

3) Half-normal Xj = |^| and y { ~ iV(0, 1) 

4) Beta x { ~ /3(2, 5) 

5) Mixture 1, 66.6% of elements of x { chosen from N(0, 1) and 33.3% from iV(100, 1) 

6) Mixture 2 50% of elements of x« +1 chosen from iV(0, 1) and the rest from 7Y(100, 1) 

7) Mixture 3 90% of elements of X{ chosen from half-normal N(0, 1) and the rest set to 10. 

8) Mixture 4 66.6% of elements of Xi chosen from half-normal iV(0, 1) and 33.3% from 
iV(100,l) 

9) Mixture 5 50% of elements of Xi +1 chosen from half-normal iV(0, 1) and the rest from 

iV(100,l) 

In addition, we performed experiments in which one or more components of x took very large 
values ~ 10 9 . The reasons for our selection of the distributions are the following. The median is 
scale invariant, so the parameters of U and N are taken without loss of generality. We wanted 
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to test the algorithms for symmetric and asymmetric distributions, such as half-normal and beta, 
as well as for mixtures. The use of half-normal distributions is motivated by our background 
application, that of regression, which is described in the next section. If the data follow a linear 
model with normally distributed noise, the absolute residuals follow a half-normal distribution. 
If there are large outliers in the data, we will have a mixture of distributions for the residuals. We 
tested mixtures with both equal and unequal proportions of the components. These proportions 
correspond to the proportion of the outliers in our background application. The model with a 
mixture of half-normal and normal is the closest to our application, and in fact this distribution 
was observed when we used regression residuals as the data. 

B. Algorithms and testing environment 

Our preliminary analysis allowed us to exclude several algorithms. Golden section was inferior 
to Brent's method. In fact, Brent's algorithm from ll24l is organized in such a way that it uses the 
method of parabolas, and reverts to the golden section if the method of parabolas is unsuccessful, 
so it is always no worse than golden section. The quasi-Newton method was very unstable, and 
failed to converge in most cases. 

Thus we performed detailed numerical comparison of seven algorithms: quickselect, quick- 
select on GPU, GPU radix sort, Brent's method of optimization, cutting plane, bisection and 
Brent's method of solving nonlinear equations. 

We measured the average time of each algorithm over 10 instances of each data set for every 
fixed n, and repeated calculations for each instance 10 times. We excluded the time spent on 
loading the data set from a file and on transferring it to the GPU device. 

Our hardware was Tesla C2050 with p = 448 cores and 3 GB of RAM, connected to a four- 
core Intel i7 CPU with 4 GB RAM clocked at 2.8 GHz, running Linux (Fedora 12). Before 
presenting the results, we mention some of the key performance parameters. Transfer of a 32M 
array of floats/doubles from GPU to CPU on our system takes over 230/455 ms, while transfer 
of a 500K array takes only 4/6.1 ms. Radix sort on GPU takes 65ms for a 32M array of floats, 
but degrades to 226 ms for 32M array of doubles, which are similar values to those reported in 
11291 . One reduction on GPU when calculating objective © or its subgradient g took 3ms for a 
32M array of doubles (1.9 ms for floats). 

In our implementation of the algorithms, we relied on Thrust library lfl"5l . in particular, 
Thrust's implementation of radix sort, and the transformjreduce and copy_if functions. Thrust 
library offers a number of high-level algorithms, such as reduction and transformations with 
arbitrary unary and binary operators, and automatically selects the most appropriate configuration 
parameters (such as block size and number of blocks in a warp). Therefore we do not report 
experiments with these parameters, assuming that the optimal choice was made by Thrust. 

We outline the program code used to calculate the the objective / and its subgradient g needed 
by Algorithm 1 in Figured] The actual implementation of our algorithm called cp_select li- 
brary, can be downloaded from http : / /www . deakin . edu . au/~gleb/cp_select . html 
We want to stress the simplicity of the algorithm, which takes only a few lines of code relying on a 
high-level interface to standard algorithms, such as reduction and conditional copy, implemented 
in the Thrust library. 
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C. Empirical results 

We present the results of our numerical experiments in Tables U and [TT] for single and double 
precision respectively, and graphically in Figures 1213 1 (the plot is the log-log plot). We present 
the average results for all distributions, because we did not observe any significant deviations 
from the average values for different distributions of the data. In fact, for the most efficient GPU 
radix sort and the proposed method, CPU times were very consistent: the largest and the smallest 
CPU time were within less than 5% from the average values. For quickselect (both CPU and 
GPU versions) the variations were larger, but we did not observe any pattern related to the type 
of the distribution, only variations between the instances of data from the same distribution. 

We observe that the use of quickselect on the CPU was more efficient only for the smallest 
data set tested n = 8192, and that for larger data sets GPU radix sort was faster by a factor of 
thirteen and five for single and double precision respectively. Both copying the data and actual 
selection algorithm on CPU have contributed to the cost of quickselect on CPU. We see from the 
tables that the time spent on just copying the data to CPU was larger than that of GPU radix sort 
starting from n > 2 15 , and the same was true for the time used by quickselect algorithm itself. 
Execution of quickselect on GPU in a single thread was worse than the remaining alternatives 
by a very large factor. 

GPU radix sort was the most efficient method for up to n = 2 21 . At that point the methods 
based on minimization outperformed GPU radix sort and consistently remained more efficient for 
large data sets. It looks from the plots of the logarithms of CPU time, that starting from n = 2 23 
the CPU time behaved like 0(n) for all methods, and the difference was due to a constant 
factor. The difference in performance of GPU radix sort and the proposed method was more 
pronounced for double precision values, as one would have expected, because the performance 
of the radix sort depends on the number of bits in the key. 

Among the methods of minimization/root finding, bisection was the slowest. Brent's root 
finding algorithm delivered similar performance to the cutting plane method, but its performance 
degraded when data contained very large outliers, as explained below. 

The difference in CPU time between the fastest existing alternative GPU radix sort and our 
cutting plane method is almost six-fold for large arrays n = 2 27 (double) and three-fold (single 
precision). 



D. Discussion 

We note that unlike traditional sorting and selection algorithms, our methods based on mini- 
mization and root-finding are not sensitive to how the data are organized in the array x (e.g., at 
random, pre-sorted, or sorted in inverse order). Indeed expression © is invariant with respect 
to permutations of the components of x. 

However, the performance of all algorithms based on minimization or solving nonlinear equa- 
tions, except the cutting plane method, drastically decreased when just one or a few components 
of x took very large values (e.g., 10 9 ), which could happen in our application. We do not report 
the CPU times, because they depend on how large was the outlier. In fact we could slow down 
the mentioned algorithms to any degree by taking a sufficiently large outlier. That is, all of our 
methods except cutting plane, are sensitive to the values of X{. The reason is the following. 
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Clearly, the number of iterations of the bisection method required to achieve a fixed accuracy 
is O(logr), where r = xi n \ -im is the range of the data, which can be unbounded. The same is 
true for the golden section optimization method, which is the counterpart of the bisection. Both 
Brent's methods reverted to the slower bisection/golden section methods because the objective 
is linear in almost all of the range of the data when Xi have such an uneven distribution. Hence 
the parabolic fits attempted by Brent's method are unsuccessful, see Figure |5] 

The only optimization method insensitive to very large (or small) values of x,- L was Kelley's 
cutting plane method. The reason this method was successful is that it uses not only the values 
of the objective but also its subgradients, as well as convexity of the objective. The cutting plane 
method iteratively builds a lower piecewise linear approximation to the objective, which also 
happens to be piecewise linear. In that way, just one value of the objective and its subgradient 
allows the algorithm to eliminate large uninteresting linear pieces (such as the intervals between 
a large outlier Xi and the bulk of the data), which do not contain the minimizer. 

However, even the cutting plane method had problems when some components of x were 
extremely large, of order of 10 20 , because of finite precision of machine arithmetic. When such 
a large value is accounted for in the sum (0Q), all the subsequent terms do not contribute to the 
sum because of loss of precision, even though their combined value can be sufficiently large. 
This is a well known effect in numerical analysis. 

To circumvent this issue we used the following approach. The order statistics, and the median 
in particular, are invariant with respect to monotone transformations. Then we can transform the 
elements of x by applying an increasing function F componentwise F(x) = (F(xi), . . . , F(x n )), 
calculate the median of the transformed array medp = Mec?(F(x)), and then take med = 
F^imedp). In our work we used transformation F(t) = log(l + t — xm). In this case the loss 
of accuracy in summation is eliminated. 

As we mentioned in Section [IV] the complexity of the cutting plane algorithm is maxit + 1 
parallel reductions, plus the time spent on conditional copy and sorting a reduced array z. In all 
our experiments, the array z turned out to be significantly smaller than x, typically between to 
1% and 5% of the length of x. Even if, hypothetically, z is almost as large as x, the complexity 
of the proposed method is tied to the complexity of the GPU radix sort, because the use of 
cutting plane adds only a fixed initial overhead. We were unable to design a data set which 
would lead to such a hypothetical case. 

Finally, we will mention the situation where the data are distributed across several GPUs. 
Parallel sorting algorithms require transfers of data between GPUs, which has a non-negligible 
cost, even when such transfers do not involve the CPU, and unified virtual addressing is used, as 
in CUDA 4.0. In contrast, calculation of CQ) and its subgradient is embarrassingly parallel, and 
involves reductions executed independently on different GPUs. The partial sums from several 
GPUs are added together on the CPU. It works in a similar way for multiple GPUs connected 
to different hosts, which are in turn connected through MPI interface, as only small portions of 
data need to be transferred on few occasions (of course, we assume that the number of GPUs is 
not large). Therefore we see the approach based on minimization as the most suitable alternative 
to selection on multiple GPUs 
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VI. Application to robust regression 

Next we outline our motivating problem of high breakdown regression, the role of the median 
function and our GPU based algorithm. We also mention another application from data analysis 
which requires multiple calculations of the order statistics. 

We consider the classical linear regression problem: given a set of pairs {(xi, Hi)}, i — 1, • • • , n: 
Xi G ffl 3 , yi G 3i (data), and a set of linear models f e : 3ft p — > 3? parameterized by a vector of 
parameters ^ 6 (1 C ^ determine the parameter vector 6*, such that fg* fits the data best. The 
linear models are 

Vi = xaOx + . . . + x ip 9p + £ h i = l,...,n, (3) 

with x iv = 1 for regression with an intercept term. {:%} =16 9ft nxp is the matrix of explanatory 
variables and e is an n-vector of iid random errors with zero mean and unknown variance. 

The goodness of fit is expressed in terms of either squared or absolute residuals = fe(xi)—yi, 
namely the weighted averages J2i=i w i r 1 ( me least squares (LS) regression), or Ym=i w i\ r 'i\ ( me 
least absolute deviations (LAD) regression). 

The LS and LAD problems are not robust to outliers: it is sufficient to take just one contami- 
nated point with an extremely large or small value of or Xij to break the regression model. The 
breakdown point of the LS and LAD estimators, i.e., the proportion of data that can make the 
estimator's bias arbitrarily large, is (see books [|T9l , [|27l ). To overcome the lack of robustness 
of the LS and LAD estimators, Rousseeuw [|25l introduced the Least Median of Squares (LMS) 
estimator based on the solution of 

Minimize F(8) = Medij-^6)) 2 . 

He also introduced the method of the least trimmed squares (LTS) ll25l . which is considered 
superior to the LMS ll26l . Il28l . Here the expression to be minimized is 

h 

Minimize F{0) = ^(r w (#)) 2 , 
i=i 

where the residuals are ordered in the increasing order |rm| < |rv 2 )| < . . . < \r( n )\, and h = 
[(n + p)/2]. 

In essence, in the LMS and LTS methods, half of the sample is discarded as potential 
outliers, and the model is fitted to the remaining half. This makes the estimator not sensitive to 
contamination in up to a half of the data. Of course, the problem is to decide which half of the 
data should be discarded, and this is very challenging. 

Numerical computation of the LMS or LTS estimator involves multiple evaluations of F{6). 
We clearly see the need in an efficient evaluation of the median in the LMS method, which is 
the motivation for this work. Next we also show that the median can simplify calculation of the 
LTS objective, even though it appears that the vector of residuals needs to be fully sorted. 

We rewrite the LTS objective in this form 



n 

Minimize F{9) = $>(|r (i) (#)| 2 ), 
i=i 



(4) 
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with 

1 if t < Med(r) 
p(t) = I f if t = Med(r) 
I otherwise 

The integers a and b are computed from the multiplicity of the median, i.e., the number of 
components of r equal to the median. Consider the numbers b L = count(\ri\ < Med(r)) and 
b = count(\ri\ = Med(r)). In the LTS method take h = for odd n, and h = § for even n. 
Then h = bL + a for some a < b. Now if we take a and b calculated in this way in ©, the value 
of F in © will be the sum of exactly h smallest residuals, and will coincide with F in the LTS 
method. The values of a and b are easily calculated once the median is known, on either CPU 
(in 0(n) time) or GPU (in 0(- + logp) time) by reduction. 

Hence, calculation of the median is also valuable for the LTS method, where partial sorting 
of the absolute residuals can be replaced with a cheaper method based on the median. 

Another application which requires multiple calculations of order statistics, and which is easily 
parallelized for GPUs is the k-nearest neighbor method (kNN), which is used in both regression 
and classification. In the case of regression, it consists in approximating a function value at x 
by /(x) = 5]. jiBj/j, where the values fa are the ordinates of the k points xi, . . . , closest 
to x (in some metric, usually the Euclidean distance) and Wi are the weights that are decreasing 
functions of the distances di = ||x — x»||. In the case of classification, the majority vote (among 
the k nearest neighbors) is applied to determine the predicted class of the query point x. 

The kNN method is suitable for parallelization on GPUs. The array of distances d, can be 
calculated on GPUs in parallel in O(-) time. The usual approach to selecting the k nearest 
neighbors is to sort the data according to the distance to x. However, one can do better by using 
the k-th order statistic. Indeed, by adapting the function p in (HJ), we obtain an indicator function, 
which returns a non-zero value for those data (xj, fa) that are no further from x than the A;-order 
statistic dfk). Then the weighted sum of k nearest neighbors is calculated by reduction. Thus, 
efficient parallel calculations of fc-order statistics is useful in the kNN method as well. 



VII. Conclusion 

We considered a classical selection problem and discussed various approaches to parallel 
computation of the median and order statistics on GPUs. We compared numerical performance of 
several existing and proposed methods by using a number of data sets with standard and unusual 
distributions, and found that the most efficient way to calculate the median is by minimizing 
a convex function using the cutting plane method, followed by sorting a reduced sample. This 
approach is between three and six times more efficient than the fastest existing method based 
on GPU radix sort, depending on the data type used. 

The proposed algorithm is easy to implement using high-level interface to GPU programming 
implemented in Thrust library, which is bundled with the newest CUDA 4.0 release. It uses 
a standard reduction operation, which is efficiently implemented in Thrust. Furthermore, our 
approach is scalable to multiple GPU devices, as only a very small number of transfers of data 
between GPU and CPU is required. 
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We outlined two applications in data analysis which require an efficient calculation of the 
medians and order statistics. Both applications are easily parallelized for GPUs, and benefit 
from an efficient GPU-based selection algorithm. 
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typedef thrust :: pair<double , unsigned int> Mypair; 
template <typename T, typename Tl> 

struct abs_diff : public thrust :: unary_function<T, Tl> 
{ 

const double y; 

abs_diff (double _a) : y(_a) {} 

host device 

Tl operator () (const T& a) { 
if (y >= a) return Mypair (y-a, 1 ) ; 

else return Mypair (a-y, ) ; 

} 

}; 

template <typename T> 

struct plusplus : public thrust :: binary_function<T, T, T> 
{ 

host device 

T operator () (const T& a, const T& b) 

{ return Mypair (a . first+b . first, a . second+b . second) ; } 

}; 

void Ob jective (int n, double t, double * f, double * df) 
{ 

/* calculates the values of the objective and its subgradient */ 
abs_dif f <double , Mypair> unary_op(t); 
plusplus<Mypair> binary_op; 

Mypair initpair ( . , ) ; 
Mypair result = 

thrust : : t r an sform_r educe (data . begin ( ) , data . end ( ) , 
unary_op, initpair, binary_op) ; 
*df = 2 . *result . second-n; 
*f = result . first ; 

} 

void SortZ (double * result, double L, double R, int index) 
/* Copies the data satisfying L<data[i]<R into Z and returns the n/2-index 
order statistic of Z after sorting. */ 

{ 

inside_interval pred(L, R) ; 
Doublelterator endZ = 

thrust : : copy_if (data .begin ( ) , data . end ( ) , Z . begin ( ) , pred) ; 

thrust : : sort ( Z . begin ( ) , endZ ) ; 
*result=Z [n/2-index] ; 

} 

Fig. 1: A fragment of C code for evaluation of the objective f{t) and its subgradient g(t) in 
parallel by using Thrust transf orm_reduce function. The function Objective is called 
by Algorithm 1, and once it terminates after maxit iterations, the function SortZ calculates 
the median of x by sorting array z containing the data in the pivot interval [L,R]. 
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TABLE I: The mean time (ms) for each method used to calculate the median: data type float. 
The best timing is in boldface. 











Size of the data set n 






Method 


8192 


32768 


131072 


524288 


2097152 


8388608 


33554432 


134 x 10 6 


JVULllA OUll ^Ull uruj 






94 


1.93 


4.87 


17.07 


UJ.UO 


282.10 


Quickselect (on CPU) 


0.21 


0.93 


3.75 


14.62 


57.49 


235.85 


940.1 




including: 


















- copy to CPU 


0.1 


0.3 


1.1 


4 


14.5 


58.5 


232 




- algorithm 


0.11 


0.63 


2.65 


10.62 


42.99 


177.35 


708.1 




Quickselect (on GPU) 


4.88 


21.00 


80.26 


326.59 


1311.60 


5530 


21951 




Cutting Plane (total) 


0.70 


0.80 


1.04 


3.29 


4.32 


8.34 


24.48 


88.00 


including: 


















- CP iterations 


0.29 


0.32 


0.49 


2.40 


2.90 


4.80 


14.20 


62.00 


- copy_if 


0.19 


0.22 


0.25 


0.47 


0.61 


1.64 


5.98 


13.90 


- Radix sort of z 


0.22 


0.26 


0.30 


0.42 


0.81 


1.90 


4.30 


12.10 


Bisection 


0.77 


1.11 


1.44 


3.44 


10.7 


51.6 


216.4 




Brent's minimization 


0.97 


1.12 


1.4 


2.25 


5.85 


20.75 


86.4 




Brent's nonlinear eqn 


0.68 


0.79 


0.87 


2.34 


3.92 


8.79 


31.25 





TABLE II: The mean time (ms) for each method used to calculate the median: data type double. 
The best timing is in boldface. 











Size of the data set n 






Method 


8192 


32768 


131072 


524288 


2097152 


8388608 


33554432 


134 x 10 6 


Radix Sort (on GPU) 


0.45 


0.59 


1.98 


4.65 


15.17 


57.52 


226.39 


820.76 


Quickselect (on CPU) 


0.29 


1.12 


4.52 


17.12 


68.11 


280.83 


1107.82 




including: 

- copy to CPU 


0.1 


0.48 


1.74 


6.1 


19.1 


64 


455 




- algorithm 


0.19 


0.64 


2.78 


11.02 


49.01 


216.83 


652.8 




Quickselect (on GPU) 


5.25 


20.72 


87.10 


358.58 


1429.4 


6175.6 


24056.7 




Cutting Plane (total) 


1.00 


1.10 


3.24 


3.68 


5.26 


11.95 


37.84 


139.84 


including: 

- CP iterations 


0.44 


0.5 


2.10 


2.36 


3.20 


6.80 


23.00 


94.00 


- copy_if 


0.28 


0.29 


0.72 


0.83 


0.96 


2.95 


9.74 


22.84 


- Radix sort of z 


0.28 


0.31 


0.42 


0.49 


1.10 


2.20 


5.10 


23.00 


Bisection 


1.02 


1.24 


4.12 


5.41 


12.1 


61.2 


298 




Brent's minimization 


1.01 


1.19 


3.12 


4.02 


5.45 


26.75 


92.1 




Brent's nonlinear eqn 


0.97 


1.04 


2.81 


3.25 


4.49 


11.14 


39.2 





16 




Fig. 2: Average time (in milliseconds, on a logarithmic scale) used by different methods to 
calculate the median, for data type float. 
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Fig. 3: Average time (in milliseconds, on a logarithmic scale) used by different methods to 
calculate the median, for data type double. 
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h,f 




Fig. 4: Illustration of the cutting plane algorithm. The thick solid line is the graph of the objective 
/.The minimizer of / is always bracketed by y L and y R , which are updated at every iteration. 
The point of intersection of two lines tangent to the graph of / yields the value y k+1 of the new 
iteration. 
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h,f 




Fig. 5: Non- sensitivity of the cutting plane algorithm to extreme values of Xi. Even if the right end 
of the interval [y L , y R ] is very large, the cutting plane algorithm avoids exploring the uninteresting 
interval [u,y R ], because on that interval / coincides with h 2 (y) — (y — yR)g(yR) + fiun), and 
the next iteration y 3 cannot be selected in [u, y R \. In contrast, bisection makes many iterations in 
[u, y R \, the larger y R , the more iterations in [u, y R \. Similarly, Brent's method attempts parabolic 
fit using 3 points from [u, y R ], but since these points are collinear, it reverts to golden section, 
which again explores [u,y R ]. 



